• DOMAIN: Automobile
• CONTEXT: The data concerns city-cycle fuel consumption in miles per gallon, to be predicted in terms of 3 multivalued discrete and 5
continuous attributes
• DATA DESCRIPTION: The data concerns city-cycle fuel consumption in miles per gallon
• Attribute Information:
• PROJECT OBJECTIVE: Goal is to cluster the data and treat them as individual datasets to train Regression models to predict ‘mpg’
Steps and tasks:
• Import all the given datasets and explore shape and size. • Merge all datasets onto one and explore final shape and size. • Export the final dataset and store it on local machine in .csv, .xlsx and .json format for future use. • Import the data from above steps into python.
Data cleansing: • Missing/incorrect value treatment • Drop attribute/s if required using relevant functional knowledge • Perform another kind of corrections/treatment on the data.
Data analysis & visualisation: • Perform detailed statistical analysis on the data. • Perform a detailed univariate, bivariate and multivariate analysis with appropriate detailed comments after each analysis. Hint: Use your best analytical approach. Even you can mix match columns to create new columns which can be used for better analysis. Create your own features if required. Be highly experimental and analytical here to find hidden patterns.
Machine learning: • Use K Means and Hierarchical clustering to find out the optimal number of clusters in the data. • Share your insights about the difference in using these two methods.
Answer below questions based on outcomes of using ML based methods. • Mention how many optimal clusters are present in the data and what could be the possible reason behind it. • Use linear regression model on different clusters separately and print the coefficients of the models individually • How using different models for different clusters will be helpful in this case and how it will be different than using one single model without clustering? Mention how it impacts performance and prediction.
Improvisation: • Detailed suggestions or improvements or on quality, quantity, variety, velocity, veracity etc. on the data points collected by the company to perform a better data analysis in future.
import numpy as np
from sklearn.linear_model import LinearRegression
from scipy import stats
from scipy.stats import zscore
from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.cluster.hierarchy import fcluster
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set(color_codes=True)
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import train_test_split
#Loading the json file and concatinating it with car names
Data=pd.read_json ("https://lms-uploads.s3.ap-southeast-1.amazonaws.com/account_1/attachments/2056422/Part1%20-%20Car-Attributes.json?response-content-disposition=attachment%3B%20filename%3D%22Part1%20-%20Car-Attributes.json%22%3B%20filename%2A%3DUTF-8%27%27Part1%2520%252D%2520Car%252DAttributes.json&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIA434J2A76JGUD2PBI%2F20210731%2Fap-southeast-1%2Fs3%2Faws4_request&X-Amz-Date=20210731T054412Z&X-Amz-Expires=86400&X-Amz-SignedHeaders=host&X-Amz-Signature=f91b802e82f43b56d0ff76901d8d147d822f262d95c5ff0b8203b5475dbb66f3")
Data1=pd.read_csv("https://lms-uploads.s3.ap-southeast-1.amazonaws.com/account_1/attachments/2056424/Part1%20-%20Car%20name.csv?response-content-disposition=attachment%3B%20filename%3D%22Part1%20-%20Car%20name.csv%22%3B%20filename%2A%3DUTF-8%27%27Part1%2520%252D%2520Car%2520name.csv&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIA434J2A76JGUD2PBI%2F20210731%2Fap-southeast-1%2Fs3%2Faws4_request&X-Amz-Date=20210731T054422Z&X-Amz-Expires=86400&X-Amz-SignedHeaders=host&X-Amz-Signature=7c55576c901fce350498fe3c4d6c35f56ab0797c2efb44fbd0852c8187d5ab05")
cData=pd.concat([Data,Data1],axis=1)
cData.head()
| mpg | cyl | disp | hp | wt | acc | yr | origin | car_name | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 18.0 | 8 | 307.0 | 130 | 3504 | 12.0 | 70 | 1 | chevrolet chevelle malibu |
| 1 | 15.0 | 8 | 350.0 | 165 | 3693 | 11.5 | 70 | 1 | buick skylark 320 |
| 2 | 18.0 | 8 | 318.0 | 150 | 3436 | 11.0 | 70 | 1 | plymouth satellite |
| 3 | 16.0 | 8 | 304.0 | 150 | 3433 | 12.0 | 70 | 1 | amc rebel sst |
| 4 | 17.0 | 8 | 302.0 | 140 | 3449 | 10.5 | 70 | 1 | ford torino |
row, column = cData.shape
print('The dataset contains', row, 'rows and', column, 'columns')
The dataset contains 398 rows and 9 columns
cData.to_csv('mpg.csv', index=False)
cData.to_excel('mpg.xlsx', index = False)
cData.to_json('mpg.json', orient = 'split', compression = 'infer', index = 'true')
#dropping/ignoring car_name
cData = cData.drop('car_name', axis=1)
# Also replacing the categorical var with actual values
cData['origin'] = cData['origin'].replace({1: 'america', 2: 'europe', 3: 'asia'})
cData.head()
| mpg | cyl | disp | hp | wt | acc | yr | origin | |
|---|---|---|---|---|---|---|---|---|
| 0 | 18.0 | 8 | 307.0 | 130 | 3504 | 12.0 | 70 | america |
| 1 | 15.0 | 8 | 350.0 | 165 | 3693 | 11.5 | 70 | america |
| 2 | 18.0 | 8 | 318.0 | 150 | 3436 | 11.0 | 70 | america |
| 3 | 16.0 | 8 | 304.0 | 150 | 3433 | 12.0 | 70 | america |
| 4 | 17.0 | 8 | 302.0 | 140 | 3449 | 10.5 | 70 | america |
print('The data type of each attribute: \n')
cData.info()
The data type of each attribute: <class 'pandas.core.frame.DataFrame'> RangeIndex: 398 entries, 0 to 397 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 mpg 398 non-null float64 1 cyl 398 non-null int64 2 disp 398 non-null float64 3 hp 398 non-null object 4 wt 398 non-null int64 5 acc 398 non-null float64 6 yr 398 non-null int64 7 origin 398 non-null object dtypes: float64(3), int64(3), object(2) memory usage: 25.0+ KB
cData.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| mpg | 398.0 | 23.514573 | 7.815984 | 9.0 | 17.500 | 23.0 | 29.000 | 46.6 |
| cyl | 398.0 | 5.454774 | 1.701004 | 3.0 | 4.000 | 4.0 | 8.000 | 8.0 |
| disp | 398.0 | 193.425879 | 104.269838 | 68.0 | 104.250 | 148.5 | 262.000 | 455.0 |
| wt | 398.0 | 2970.424623 | 846.841774 | 1613.0 | 2223.750 | 2803.5 | 3608.000 | 5140.0 |
| acc | 398.0 | 15.568090 | 2.757689 | 8.0 | 13.825 | 15.5 | 17.175 | 24.8 |
| yr | 398.0 | 76.010050 | 3.697627 | 70.0 | 73.000 | 76.0 | 79.000 | 82.0 |
# isdigit()? on 'horsepower'
hpIsDigit = pd.DataFrame(cData.hp.str.isdigit())
#print isDigit = False!
cData[hpIsDigit['hp'] == False]
| mpg | cyl | disp | hp | wt | acc | yr | origin | |
|---|---|---|---|---|---|---|---|---|
| 32 | 25.0 | 4 | 98.0 | ? | 2046 | 19.0 | 71 | america |
| 126 | 21.0 | 6 | 200.0 | ? | 2875 | 17.0 | 74 | america |
| 330 | 40.9 | 4 | 85.0 | ? | 1835 | 17.3 | 80 | europe |
| 336 | 23.6 | 4 | 140.0 | ? | 2905 | 14.3 | 80 | america |
| 354 | 34.5 | 4 | 100.0 | ? | 2320 | 15.8 | 81 | europe |
| 374 | 23.0 | 4 | 151.0 | ? | 3035 | 20.5 | 82 | america |
# Replace missing values with NaN
cData = cData.replace('?', np.nan)
cData[hpIsDigit['hp'] == False]
| mpg | cyl | disp | hp | wt | acc | yr | origin | |
|---|---|---|---|---|---|---|---|---|
| 32 | 25.0 | 4 | 98.0 | NaN | 2046 | 19.0 | 71 | america |
| 126 | 21.0 | 6 | 200.0 | NaN | 2875 | 17.0 | 74 | america |
| 330 | 40.9 | 4 | 85.0 | NaN | 1835 | 17.3 | 80 | europe |
| 336 | 23.6 | 4 | 140.0 | NaN | 2905 | 14.3 | 80 | america |
| 354 | 34.5 | 4 | 100.0 | NaN | 2320 | 15.8 | 81 | europe |
| 374 | 23.0 | 4 | 151.0 | NaN | 3035 | 20.5 | 82 | america |
#instead of dropping the rows, lets replace the missing values with median value.
cData.median()
mpg 23.0 cyl 4.0 disp 148.5 hp 93.5 wt 2803.5 acc 15.5 yr 76.0 dtype: float64
#replace hp with median
cData['hp'].fillna((cData['hp'].median()), inplace=True)
print('The data set has no missing values \n')
cData.isnull().sum()
The data set has no missing values
mpg 0 cyl 0 disp 0 hp 0 wt 0 acc 0 yr 0 origin 0 dtype: int64
cData['mpg_level'] = cData['mpg'].apply(lambda x: 'low' if x<17 else 'high' if x>29 else 'medium')
cData.head()
| mpg | cyl | disp | hp | wt | acc | yr | origin | mpg_level | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 18.0 | 8 | 307.0 | 130.0 | 3504 | 12.0 | 70 | america | medium |
| 1 | 15.0 | 8 | 350.0 | 165.0 | 3693 | 11.5 | 70 | america | low |
| 2 | 18.0 | 8 | 318.0 | 150.0 | 3436 | 11.0 | 70 | america | medium |
| 3 | 16.0 | 8 | 304.0 | 150.0 | 3433 | 12.0 | 70 | america | low |
| 4 | 17.0 | 8 | 302.0 | 140.0 | 3449 | 10.5 | 70 | america | medium |
##creating mpg_level attribute using the 25% and 75% of mpg values( < 17 as low, > 17 and <29 as medium, > 29 as high )
#categorical variable
cData_cat = cData.iloc[:,[1,6,7,8]]
cData_cat.head()
| cyl | yr | origin | mpg_level | |
|---|---|---|---|---|
| 0 | 8 | 70 | america | medium |
| 1 | 8 | 70 | america | low |
| 2 | 8 | 70 | america | medium |
| 3 | 8 | 70 | america | low |
| 4 | 8 | 70 | america | medium |
#numeric variables
cData_num=cData.drop(['cyl','yr','origin','mpg_level'],axis=1)
cData_num.head()
| mpg | disp | hp | wt | acc | |
|---|---|---|---|---|---|
| 0 | 18.0 | 307.0 | 130.0 | 3504 | 12.0 |
| 1 | 15.0 | 350.0 | 165.0 | 3693 | 11.5 |
| 2 | 18.0 | 318.0 | 150.0 | 3436 | 11.0 |
| 3 | 16.0 | 304.0 | 150.0 | 3433 | 12.0 |
| 4 | 17.0 | 302.0 | 140.0 | 3449 | 10.5 |
#plotting categorical variables
fig = plt.figure(1, (14, 8))
for i,cData in enumerate(cData_cat.columns):
ax = plt.subplot(2,2,i+1)
sns.countplot(cData_cat[cData], order=cData_cat[cData].value_counts().index)
ax.set_xlabel(None)
ax.set_title(f'Distribution of {cData}')
plt.tight_layout()
plt.show()
#plot histograms
cData_num.hist(bins = 20, figsize = (10, 8), color = 'blue')
plt.show()
#plot density
plt.figure(figsize=(17, 13))
col = 1
for i in cData_num.columns:
plt.subplot(3, 3, col)
sns.distplot(cData_num[i], color = 'b')
col += 1
#joining the categorical and numerical variables
cData=pd.concat([cData_cat,cData_num],axis=1)
#checking for attribute type
cData.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 398 entries, 0 to 397 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 cyl 398 non-null int64 1 yr 398 non-null int64 2 origin 398 non-null object 3 mpg_level 398 non-null object 4 mpg 398 non-null float64 5 disp 398 non-null float64 6 hp 398 non-null float64 7 wt 398 non-null int64 8 acc 398 non-null float64 dtypes: float64(4), int64(3), object(2) memory usage: 28.1+ KB
#create dummy varibale for mpg_level and origin and creating a seperate datset for k-means and heirarchical clustering
cData = pd.get_dummies(cData, columns=['origin'])
cData = pd.get_dummies(cData, columns=['mpg_level'])
cData.head()
cDataH=cData.copy()
cDataK=cData.copy()
#pair plot for the numeric attributes
cData_attr = cData.iloc[:, 0:7]
sns.pairplot(cData_attr, diag_kind='kde');
#dropping the created dummy variable
cData2=cData.drop(['origin_america','origin_asia','origin_europe','mpg_level_high','mpg_level_low','mpg_level_medium'],axis=1)
#checking of outliers
plt.figure(figsize=(25, 20))
col = 1
for i in cData2.columns:
plt.subplot(6, 6, col)
sns.boxplot(cData2[i],color='blue')
col += 1
#replacing outliers with IQR (Q1 and Q3 +-1.5*IQR)
IQR1 = stats.iqr(cData2['hp'], interpolation = 'midpoint')
IQR2 = stats.iqr(cData2['acc'], interpolation = 'midpoint')
#Horsepower after imputing outliers
Q3 = cData2['hp'].quantile(0.75)
cData2['hp'] = np.where(cData2["hp"] >(Q3+1.5*IQR1), 198.5,cData2['hp'])
sns.boxplot(cData2['hp']);
#accelaration after imputing outliers
Q1 = cData2['acc'].quantile(0.25)
Q31=cData2['acc'].quantile(0.75)
cData2['acc'] = np.where(cData2["acc"] >(Q31+1.5*IQR2),22.10 ,cData2['acc'])
cData2['acc'] = np.where(cData2["acc"] <(Q1-1.5*IQR2),(Q1-1.5*IQR2),cData2['acc'])
sns.boxplot(cData2['acc'])
<AxesSubplot:xlabel='acc'>
plt.figure(figsize=(20,6))
plt.subplot(1, 3, 1)
plt.title('mpg')
sns.distplot(cData2['mpg'],color='red')
# subplot 2
plt.subplot(1, 3, 2)
plt.title('disp')
sns.distplot(cData2['disp'],color='blue')
# subplot 3l
plt.subplot(1, 3, 3)
plt.title('acc')
sns.distplot(cData2['acc'],color='green')
plt.figure(figsize=(20,6))
plt.subplot(1, 3, 1)
plt.title('mpg')
sns.boxplot(cData2['mpg'],orient='vertical',color='red')
# subplot 2
plt.subplot(1, 3, 2)
plt.title('disp')
sns.boxplot(cData2['disp'],orient='vertical',color='blue')
# subplot 3
plt.subplot(1, 3, 3)
plt.title('acc')
sns.boxplot(cData2['acc'],orient='vertical',color='green')
plt.show()
sns.countplot(cData2['mpg']);
# plot the correlation coefficients as a heatmap
#checking for correlation
plt.figure(figsize=(10,8))
corr=cData2.corr()
sns.heatmap(corr,annot=True);
#Except Year, most of the variable are correlated with each other
#separating numeric variables
cc = cData.iloc[:,0:7]
cc.head()
| cyl | yr | mpg | disp | hp | wt | acc | |
|---|---|---|---|---|---|---|---|
| 0 | 8 | 70 | 18.0 | 307.0 | 130.0 | 3504 | 12.0 |
| 1 | 8 | 70 | 15.0 | 350.0 | 165.0 | 3693 | 11.5 |
| 2 | 8 | 70 | 18.0 | 318.0 | 150.0 | 3436 | 11.0 |
| 3 | 8 | 70 | 16.0 | 304.0 | 150.0 | 3433 | 12.0 |
| 4 | 8 | 70 | 17.0 | 302.0 | 140.0 | 3449 | 10.5 |
#scaling the variable
cc_z = cc.apply(zscore)
cc_z.head()
| cyl | yr | mpg | disp | hp | wt | acc | |
|---|---|---|---|---|---|---|---|
| 0 | 1.498191 | -1.627426 | -0.706439 | 1.090604 | 0.673118 | 0.630870 | -1.295498 |
| 1 | 1.498191 | -1.627426 | -1.090751 | 1.503514 | 1.589958 | 0.854333 | -1.477038 |
| 2 | 1.498191 | -1.627426 | -0.706439 | 1.196232 | 1.197027 | 0.550470 | -1.658577 |
| 3 | 1.498191 | -1.627426 | -0.962647 | 1.061796 | 1.197027 | 0.546923 | -1.295498 |
| 4 | 1.498191 | -1.627426 | -0.834543 | 1.042591 | 0.935072 | 0.565841 | -1.840117 |
#calculating pairwise distance using average linkage method
link_method = linkage(cc_z.iloc[:,0:7], method = 'average')
#plotting the H-cluster
plt.figure(figsize=(25, 10))
dendrogram(link_method)
plt.show()
# The dendrogram function using to cut the 2 clusters
# dendrogram function to arrive at dendrogram
dendrogram(
link_method,
truncate_mode='lastp',
p=2,
)
plt.show()
#vieweing the clusters formed
clusters = fcluster(link_method, 2, criterion='maxclust')
clusters
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1,
1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1,
1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2,
2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 1, 1, 1, 1, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 2,
2, 1, 2, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1,
1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1,
1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2], dtype=int32)
#attaching the clusters formed to the scales data
cc_z['clusters_H'] = clusters
cc_z.head()
| cyl | yr | mpg | disp | hp | wt | acc | clusters_H | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1.498191 | -1.627426 | -0.706439 | 1.090604 | 0.673118 | 0.630870 | -1.295498 | 1 |
| 1 | 1.498191 | -1.627426 | -1.090751 | 1.503514 | 1.589958 | 0.854333 | -1.477038 | 1 |
| 2 | 1.498191 | -1.627426 | -0.706439 | 1.196232 | 1.197027 | 0.550470 | -1.658577 | 1 |
| 3 | 1.498191 | -1.627426 | -0.962647 | 1.061796 | 1.197027 | 0.546923 | -1.295498 | 1 |
| 4 | 1.498191 | -1.627426 | -0.834543 | 1.042591 | 0.935072 | 0.565841 | -1.840117 | 1 |
#vieweing the distribution of clusters
cc_z.clusters_H.value_counts().sort_index()
1 100 2 298 Name: clusters_H, dtype: int64
#attaching the clusters formed to the original data
cc['clusters_H']=clusters
cDataH['clusters_H']=clusters
cc.head()
| cyl | yr | mpg | disp | hp | wt | acc | clusters_H | |
|---|---|---|---|---|---|---|---|---|
| 0 | 8 | 70 | 18.0 | 307.0 | 130.0 | 3504 | 12.0 | 1 |
| 1 | 8 | 70 | 15.0 | 350.0 | 165.0 | 3693 | 11.5 | 1 |
| 2 | 8 | 70 | 18.0 | 318.0 | 150.0 | 3436 | 11.0 | 1 |
| 3 | 8 | 70 | 16.0 | 304.0 | 150.0 | 3433 | 12.0 | 1 |
| 4 | 8 | 70 | 17.0 | 302.0 | 140.0 | 3449 | 10.5 | 1 |
#create a new data set named Hclus
Hclus=cc
Hclus.head()
| cyl | yr | mpg | disp | hp | wt | acc | clusters_H | |
|---|---|---|---|---|---|---|---|---|
| 0 | 8 | 70 | 18.0 | 307.0 | 130.0 | 3504 | 12.0 | 1 |
| 1 | 8 | 70 | 15.0 | 350.0 | 165.0 | 3693 | 11.5 | 1 |
| 2 | 8 | 70 | 18.0 | 318.0 | 150.0 | 3436 | 11.0 | 1 |
| 3 | 8 | 70 | 16.0 | 304.0 | 150.0 | 3433 | 12.0 | 1 |
| 4 | 8 | 70 | 17.0 | 302.0 | 140.0 | 3449 | 10.5 | 1 |
#aggregating the numerical variable with the clusters formed with the mean
aggdata=cc.iloc[:,0:8].groupby('clusters_H').mean()
aggdata['Freq']=cc.clusters_H.value_counts().sort_index()
aggdata
| cyl | yr | mpg | disp | hp | wt | acc | Freq | |
|---|---|---|---|---|---|---|---|---|
| clusters_H | ||||||||
| 1 | 7.980000 | 73.740000 | 14.684000 | 345.470000 | 160.400000 | 4121.560000 | 12.702000 | 100 |
| 2 | 4.607383 | 76.771812 | 26.477852 | 142.404362 | 85.479866 | 2584.137584 | 16.529866 | 298 |
#Clearly shows two disting group with a difference in average between the two clusters & variables
#plotting the clusters formed
plt.figure(figsize=(10, 8))
sns.scatterplot(x="mpg", y="hp", hue="clusters_H",
data=cc_z,
palette=['green','brown']);
#seperating the numeric values
cc = cData.iloc[:,0:7]
cc_z1 = cc.apply(zscore)
cc_z1.head()
| cyl | yr | mpg | disp | hp | wt | acc | |
|---|---|---|---|---|---|---|---|
| 0 | 1.498191 | -1.627426 | -0.706439 | 1.090604 | 0.673118 | 0.630870 | -1.295498 |
| 1 | 1.498191 | -1.627426 | -1.090751 | 1.503514 | 1.589958 | 0.854333 | -1.477038 |
| 2 | 1.498191 | -1.627426 | -0.706439 | 1.196232 | 1.197027 | 0.550470 | -1.658577 |
| 3 | 1.498191 | -1.627426 | -0.962647 | 1.061796 | 1.197027 | 0.546923 | -1.295498 |
| 4 | 1.498191 | -1.627426 | -0.834543 | 1.042591 | 0.935072 | 0.565841 | -1.840117 |
#calculatint the within sum of squares
wss =[]
for i in range(1,5):
KM = KMeans(n_clusters=i)
KM.fit(cc_z1)
wss.append(KM.inertia_)
wss
[2785.9999999999995, 1294.8418950727323, 946.019790855379, 738.3923281527318]
#plotting the WSS against the number of cluster to come up with optimal number of clusters using Elbow-method
plt.plot(range(1,5), wss);
plt.title('Elbow Method');
plt.xlabel("Number of Clusters")
plt.ylabel("WSS");
k_means = KMeans(n_clusters = 2)
k_means.fit(cc_z1)
labels = k_means.labels_
silhouette_score(cc_z1,labels)
0.48235946103916116
#calculating silhouette score for different centroids
kmeans_kwargs = {
"init": "random",
"n_init": 10,
"max_iter": 300,
"random_state": 42,
}
silhouette_coefficients = []
# Notice start at 2 clusters for silhouette coefficient
for k in range(2, 7):
kmeans = KMeans(n_clusters=k, **kmeans_kwargs)
kmeans.fit(cc_z1)
score = silhouette_score(cc_z1,kmeans.labels_)
silhouette_coefficients.append(score)
#plotting silhouette score for different centroids
plt.plot(range(2, 7), silhouette_coefficients)
plt.xticks(range(2, 7))
plt.xlabel("Number of Clusters")
plt.ylabel("Silhouette Coefficient")
plt.show()
#attaching the labels to the datasets
cc["cluster_K"] = labels
cDataK['cluster_K']=labels
Kclus=cc
Kclus.head()
| cyl | yr | mpg | disp | hp | wt | acc | cluster_K | |
|---|---|---|---|---|---|---|---|---|
| 0 | 8 | 70 | 18.0 | 307.0 | 130.0 | 3504 | 12.0 | 0 |
| 1 | 8 | 70 | 15.0 | 350.0 | 165.0 | 3693 | 11.5 | 0 |
| 2 | 8 | 70 | 18.0 | 318.0 | 150.0 | 3436 | 11.0 | 0 |
| 3 | 8 | 70 | 16.0 | 304.0 | 150.0 | 3433 | 12.0 | 0 |
| 4 | 8 | 70 | 17.0 | 302.0 | 140.0 | 3449 | 10.5 | 0 |
#vieweing the distribution of clusters
cc.cluster_K.value_counts().sort_index()
0 105 1 293 Name: cluster_K, dtype: int64
#attaching the clsuters to the scaled data
cc_z1["cluster_K"] = labels
cc_z1.head()
| cyl | yr | mpg | disp | hp | wt | acc | cluster_K | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1.498191 | -1.627426 | -0.706439 | 1.090604 | 0.673118 | 0.630870 | -1.295498 | 0 |
| 1 | 1.498191 | -1.627426 | -1.090751 | 1.503514 | 1.589958 | 0.854333 | -1.477038 | 0 |
| 2 | 1.498191 | -1.627426 | -0.706439 | 1.196232 | 1.197027 | 0.550470 | -1.658577 | 0 |
| 3 | 1.498191 | -1.627426 | -0.962647 | 1.061796 | 1.197027 | 0.546923 | -1.295498 | 0 |
| 4 | 1.498191 | -1.627426 | -0.834543 | 1.042591 | 0.935072 | 0.565841 | -1.840117 | 0 |
#aggregating the numerical variable with the clusters formed with the mean
aggdata=cc.iloc[:,0:8].groupby('cluster_K').mean()
aggdata['Freq']=cc.cluster_K.value_counts().sort_index()
aggdata
| cyl | yr | mpg | disp | hp | wt | acc | Freq | |
|---|---|---|---|---|---|---|---|---|
| cluster_K | ||||||||
| 0 | 7.923810 | 73.742857 | 14.851429 | 341.809524 | 158.000000 | 4093.771429 | 12.867619 | 105 |
| 1 | 4.569966 | 76.822526 | 26.619113 | 140.250853 | 85.061433 | 2567.860068 | 16.535836 | 293 |
#plotting the clusters
plt.figure(figsize=(10, 8))
sns.scatterplot(x="mpg", y="hp", hue="cluster_K",
data=cc_z1,
palette=['green','red']);
cDataH.clusters_H.value_counts().sort_index()
1 100 2 298 Name: clusters_H, dtype: int64
cDataK.cluster_K.value_counts().sort_index()
0 105 1 293 Name: cluster_K, dtype: int64
cDataH.shape
(398, 14)
cData.head()
| cyl | yr | mpg | disp | hp | wt | acc | origin_america | origin_asia | origin_europe | mpg_level_high | mpg_level_low | mpg_level_medium | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 8 | 70 | 18.0 | 307.0 | 130.0 | 3504 | 12.0 | 1 | 0 | 0 | 0 | 0 | 1 |
| 1 | 8 | 70 | 15.0 | 350.0 | 165.0 | 3693 | 11.5 | 1 | 0 | 0 | 0 | 1 | 0 |
| 2 | 8 | 70 | 18.0 | 318.0 | 150.0 | 3436 | 11.0 | 1 | 0 | 0 | 0 | 0 | 1 |
| 3 | 8 | 70 | 16.0 | 304.0 | 150.0 | 3433 | 12.0 | 1 | 0 | 0 | 0 | 1 | 0 |
| 4 | 8 | 70 | 17.0 | 302.0 | 140.0 | 3449 | 10.5 | 1 | 0 | 0 | 0 | 0 | 1 |
X = cData.drop(['mpg','origin_europe','mpg_level_low'], axis=1)
# the dependent variable
y = cData[['mpg']]
# Split X and y into training and test set in 70:30 ratio
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=15)
regression_model = LinearRegression()
regression_model.fit(X_train, y_train)
LinearRegression()
for idx, col_name in enumerate(X_train.columns):
print("The coefficient for {} is {}".format(col_name, regression_model.coef_[0][idx]))
The coefficient for cyl is -0.5134441386218939 The coefficient for yr is 0.44346504291168265 The coefficient for disp is 0.01068885839464703 The coefficient for hp is 0.010315514536313882 The coefficient for wt is -0.004538788568737155 The coefficient for acc is 0.1918342560886259 The coefficient for origin_america is -1.7306209513689055 The coefficient for origin_asia is -0.8976724344009429 The coefficient for mpg_level_high is 8.552374663817028 The coefficient for mpg_level_medium is 1.5941218694850463
intercept = regression_model.intercept_[0]
print("The intercept for our model is {}".format(intercept))
The intercept for our model is -1.6635717568649753
regression_model.score(X_train, y_train)
0.8967703023839786
O=regression_model.score(X_test, y_test)
O
0.9037421476349176
#renaming the cluster labels to light and heavy vehicles and creating dummy variables of it
cDataK['cluster_K']=cDataK['cluster_K'].astype('category')
cDataK['cluster_K'] = cDataK['cluster_K'].replace({1: 'heavy', 0: 'light'})
cDataK = pd.get_dummies(cDataK, columns=['cluster_K'])
cDataK.head()
| cyl | yr | mpg | disp | hp | wt | acc | origin_america | origin_asia | origin_europe | mpg_level_high | mpg_level_low | mpg_level_medium | cluster_K_heavy | cluster_K_light | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 8 | 70 | 18.0 | 307.0 | 130.0 | 3504 | 12.0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
| 1 | 8 | 70 | 15.0 | 350.0 | 165.0 | 3693 | 11.5 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
| 2 | 8 | 70 | 18.0 | 318.0 | 150.0 | 3436 | 11.0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
| 3 | 8 | 70 | 16.0 | 304.0 | 150.0 | 3433 | 12.0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
| 4 | 8 | 70 | 17.0 | 302.0 | 140.0 | 3449 | 10.5 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
X = cDataK.drop(['mpg','origin_europe','mpg_level_low','cluster_K_light'], axis=1)
# the dependent variable
y = cDataK[['mpg']]
# Split X and y into training and test set in 70:30 ratio
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=12)
regression_model = LinearRegression()
regression_model.fit(X_train, y_train)
LinearRegression()
for idx, col_name in enumerate(X_train.columns):
print("The coefficient for {} is {}".format(col_name, regression_model.coef_[0][idx]))
The coefficient for cyl is -1.1945995644777487 The coefficient for yr is 0.4318651041505978 The coefficient for disp is 0.017477496279109928 The coefficient for hp is -0.0101380458359059 The coefficient for wt is -0.004068430169386406 The coefficient for acc is 0.18564828746249984 The coefficient for origin_america is -1.6918315494304044 The coefficient for origin_asia is -0.740777919230301 The coefficient for mpg_level_high is 9.28312093915688 The coefficient for mpg_level_medium is 2.250001714231246 The coefficient for cluster_K_heavy is -2.511514014338475
intercept = regression_model.intercept_[0]
print("The intercept for our model is {}".format(intercept))
The intercept for our model is 3.7156608210557103
regression_model.score(X_train, y_train)
0.8942370456543635
K=regression_model.score(X_test, y_test)
K
0.9117893808052382
#renaming the cluster labels to light and heavy vehicles and creating summy variable of it
cDataH['clusters_H']=cDataH['clusters_H'].astype('category')
cDataH['clusters_H'] = cDataH['clusters_H'].replace({1: 'heavy', 2: 'light'})
cDataH = pd.get_dummies(cDataH, columns=['clusters_H'])
X = cDataH.drop(['mpg','origin_europe','mpg_level_low','clusters_H_light'], axis=1)
# the dependent variable
y = cDataH[['mpg']]
# Split X and y into training and test set in 70:30 ratio
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=10)
regression_model = LinearRegression()
regression_model.fit(X_train, y_train)
LinearRegression()
# for idx, col_name in enumerate(X_train.columns):
print("The coefficient for {} is {}".format(col_name, regression_model.coef_[0][idx]))
The coefficient for cluster_K_heavy is 2.0389744688074063
intercept = regression_model.intercept_[0]
print("The intercept for our model is {}".format(intercept))
The intercept for our model is 2.57272931823265
regression_model.score(X_train, y_train)
0.8988409890950728
H=regression_model.score(X_test, y_test)
H
0.901023837384672
# The liner regression on orginal data set avarage valve get nearly 90%
# The k mean cluster valve get 91%
# The Heirarchical cluster get 90%
but with a difference of only 1% when compared with other models, to get more clarity a larger dataset may be used,&With the above mentioned features it may be possible to get a higher accuracy or explainability of the models and its variables.
• DOMAIN: Manufacturing
• CONTEXT: Company X curates and packages wine across various vineyards spread throughout the country.
• DATA DESCRIPTION: The data concerns the chemical composition of the wine and its respective quality.
Attribute Information:
A, B, C, D: specific chemical composition measure of the wine
Quality: quality of wine [ Low and High ]
• PROJECT OBJECTIVE: Goal is to build a synthetic data generation model using the existing data provided by the company.
Steps and tasks:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import zscore
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score
from sklearn.metrics import confusion_matrix, classification_report,accuracy_score
import seaborn as sns
%matplotlib inline
sns.set(color_codes=True)
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import train_test_split
#importing the data
wi=pd.ExcelFile("https://lms-uploads.s3.ap-southeast-1.amazonaws.com/account_1/attachments/1982220/Company.xlsx?response-content-disposition=attachment%3B%20filename%3D%22Part2%20-%20Company.xlsx%22%3B%20filename%2A%3DUTF-8%27%27Part2%2520%252D%2520Company.xlsx&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIA434J2A76JGUD2PBI%2F20210731%2Fap-southeast-1%2Fs3%2Faws4_request&X-Amz-Date=20210731T054433Z&X-Amz-Expires=86400&X-Amz-SignedHeaders=host&X-Amz-Signature=2544b0d8e22386d364138302ba93ca80db35303390a313fc0511da6874070639",engine='openpyxl')
wi.sheet_names
wi = wi.parse("Sheet1")
wi.head()
| A | B | C | D | Quality | |
|---|---|---|---|---|---|
| 0 | 47 | 27 | 45 | 108 | Quality A |
| 1 | 174 | 133 | 134 | 166 | Quality B |
| 2 | 159 | 163 | 135 | 131 | NaN |
| 3 | 61 | 23 | 3 | 44 | Quality A |
| 4 | 59 | 60 | 9 | 68 | Quality A |
row, column = wi.shape
print('The dataset contains', row, 'rows and', column, 'columns')
The dataset contains 61 rows and 5 columns
#make a copy of the data
wi1=wi.copy()
#dropping the target variable
wi=wi.drop('Quality',axis=1)
#checking for data types
print('The data type of each attribute: \n')
wi1.info()
The data type of each attribute: <class 'pandas.core.frame.DataFrame'> RangeIndex: 61 entries, 0 to 60 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 A 61 non-null int64 1 B 61 non-null int64 2 C 61 non-null int64 3 D 61 non-null int64 4 Quality 43 non-null object dtypes: int64(4), object(1) memory usage: 2.5+ KB
print('The Target Variable has set 18 missing values \n')
wi1.isnull().sum()
The Target Variable has set 18 missing values
A 0 B 0 C 0 D 0 Quality 18 dtype: int64
#checking the distribution of the data points
wi.hist(bins = 10, figsize = (10, 8), color = 'red')
plt.show()
#checking for the density of the variable
plt.figure(figsize=(10, 8))
col = 1
for i in wi.columns:
plt.subplot(2, 2, col)
sns.distplot(wi[i], color = 'green')
col += 1
import seaborn as sns
sns.pairplot(wi, kind="reg")
plt.show()
#scaling the numeric variables
w1 = wi.apply(zscore)
w1.head()
| A | B | C | D | |
|---|---|---|---|---|
| 0 | -1.168034 | -1.561080 | -1.061569 | -0.103138 |
| 1 | 0.904992 | 0.284923 | 0.306077 | 0.823013 |
| 2 | 0.660147 | 0.807376 | 0.321443 | 0.264129 |
| 3 | -0.939512 | -1.630740 | -1.706975 | -1.125099 |
| 4 | -0.972158 | -0.986381 | -1.614775 | -0.741864 |
#checking for the distribution of the targrt variables
wi1['Quality'].value_counts()
Quality A 26 Quality B 17 Name: Quality, dtype: int64
#applying kmeans with 2 centroids
k_means = KMeans(n_clusters = 2)
k_means.fit(w1)
labels = k_means.labels_
# Calculating silhouette_score
silhouette_score(w1,labels)
0.6891674125195145
#attaching the cluster labels to the original dataset
wi1["Quality_k"] = labels
#checking for the first 15 observations
wi1.head(15)
| A | B | C | D | Quality | Quality_k | |
|---|---|---|---|---|---|---|
| 0 | 47 | 27 | 45 | 108 | Quality A | 1 |
| 1 | 174 | 133 | 134 | 166 | Quality B | 0 |
| 2 | 159 | 163 | 135 | 131 | NaN | 0 |
| 3 | 61 | 23 | 3 | 44 | Quality A | 1 |
| 4 | 59 | 60 | 9 | 68 | Quality A | 1 |
| 5 | 153 | 140 | 154 | 199 | NaN | 0 |
| 6 | 34 | 28 | 78 | 22 | Quality A | 1 |
| 7 | 191 | 144 | 143 | 154 | NaN | 0 |
| 8 | 160 | 181 | 194 | 178 | Quality B | 0 |
| 9 | 145 | 178 | 158 | 141 | NaN | 0 |
| 10 | 76 | 94 | 107 | 37 | Quality A | 1 |
| 11 | 138 | 200 | 153 | 192 | Quality B | 0 |
| 12 | 27 | 106 | 57 | 37 | Quality A | 1 |
| 13 | 45 | 76 | 56 | 57 | Quality A | 1 |
| 14 | 186 | 142 | 146 | 174 | NaN | 0 |
#renaming the cluster labels
wi1['Quality_k'] = wi1['Quality_k'].replace({0: 'Quality A', 1: 'Quality B'})
#checking for datatypes
wi1.dtypes
A int64 B int64 C int64 D int64 Quality object Quality_k object dtype: object
#dropping na rows
wi=wi1.dropna()
#checking for the shape
wi.shape
(43, 6)
#using the cluster labels and actual target labels to check for accuracy
pred = wi['Quality_k']
actual=wi['Quality']
#visualization of confusion matrix in the form of a heatmap
cm= confusion_matrix(actual, pred)
plt.figure(figsize = (12, 8))
sns.heatmap(cm, annot = True, cmap = 'RdYlGn', fmt = 'd')
plt.xlabel('Actual Classes', fontsize = 10)
plt.ylabel('Predicted Classes', fontsize = 10)
plt.title('Confusion Matrix HeatMap', fontsize = 10);
• DOMAIN: Automobile
• CONTEXT: The purpose is to classify a given silhouette as one of three types of vehicle, using a set of features extracted from the silhouette.
The vehicle may be viewed from one of many different angles.
• DATA DESCRIPTION: The data contains features extracted from the silhouette of vehicles in different angles. Four "Corgie" model vehicles were used for the experiment: a double decker bus, Cheverolet van, Saab 9000 and an Opel Manta 400 cars. This particular combination of vehicles was chosen with the expectation that the bus, van and either one of the cars would be readily distinguishable, but it would be more difficult to distinguish between the cars. • All the features are numeric i.e. geometric features extracted from the silhouette.
• PROJECT OBJECTIVE: Apply dimensionality reduction technique – PCA and train a model using principal components instead of training the
model using just the raw data.
Steps and tasks:
Data: Import, clean and pre-process the data
EDA and visualisation: Create a detailed performance report using univariate, bi-variate and multivariate EDA techniques. Find out all possible hidden
Classifier: Design and train a best fit SVM classier using all the data attributes.
Dimensional reduction: perform dimensional reduction on the data.
Classifier: Design and train a best fit SVM classier using dimensionally reduced attributes.
Conclusion: Showcase key pointer on how dimensional reduction helped in this case.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import zscore
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, classification_report,accuracy_score
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import confusion_matrix
from sklearn import metrics
%matplotlib inline
sns.set(color_codes=True)
import warnings
warnings.filterwarnings('ignore')
vehicle = pd.read_csv("https://lms-uploads.s3.ap-southeast-1.amazonaws.com/account_1/attachments/257361/vehicle.csv?response-content-disposition=attachment%3B%20filename%3D%22Part3%20-%20vehicle.csv%22%3B%20filename%2A%3DUTF-8%27%27Part3%2520%252D%2520vehicle.csv&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIA434J2A76JGUD2PBI%2F20210731%2Fap-southeast-1%2Fs3%2Faws4_request&X-Amz-Date=20210731T054443Z&X-Amz-Expires=86400&X-Amz-SignedHeaders=host&X-Amz-Signature=71c01fdf7f785cd49ff9ab0432383899554d862a04803617af8dda00e3e0c94c")
vehicle.head()
| compactness | circularity | distance_circularity | radius_ratio | pr.axis_aspect_ratio | max.length_aspect_ratio | scatter_ratio | elongatedness | pr.axis_rectangularity | max.length_rectangularity | scaled_variance | scaled_variance.1 | scaled_radius_of_gyration | scaled_radius_of_gyration.1 | skewness_about | skewness_about.1 | skewness_about.2 | hollows_ratio | class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 95 | 48.0 | 83.0 | 178.0 | 72.0 | 10 | 162.0 | 42.0 | 20.0 | 159 | 176.0 | 379.0 | 184.0 | 70.0 | 6.0 | 16.0 | 187.0 | 197 | van |
| 1 | 91 | 41.0 | 84.0 | 141.0 | 57.0 | 9 | 149.0 | 45.0 | 19.0 | 143 | 170.0 | 330.0 | 158.0 | 72.0 | 9.0 | 14.0 | 189.0 | 199 | van |
| 2 | 104 | 50.0 | 106.0 | 209.0 | 66.0 | 10 | 207.0 | 32.0 | 23.0 | 158 | 223.0 | 635.0 | 220.0 | 73.0 | 14.0 | 9.0 | 188.0 | 196 | car |
| 3 | 93 | 41.0 | 82.0 | 159.0 | 63.0 | 9 | 144.0 | 46.0 | 19.0 | 143 | 160.0 | 309.0 | 127.0 | 63.0 | 6.0 | 10.0 | 199.0 | 207 | van |
| 4 | 85 | 44.0 | 70.0 | 205.0 | 103.0 | 52 | 149.0 | 45.0 | 19.0 | 144 | 241.0 | 325.0 | 188.0 | 127.0 | 9.0 | 11.0 | 180.0 | 183 | bus |
rows, column = vehicle.shape
print('The dataset contains', rows, 'rows and', column, 'columns.')
The dataset contains 846 rows and 19 columns.
#checking for the data type
print('The data type of each attribute: \n')
vehicle.info()
The data type of each attribute: <class 'pandas.core.frame.DataFrame'> RangeIndex: 846 entries, 0 to 845 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 compactness 846 non-null int64 1 circularity 841 non-null float64 2 distance_circularity 842 non-null float64 3 radius_ratio 840 non-null float64 4 pr.axis_aspect_ratio 844 non-null float64 5 max.length_aspect_ratio 846 non-null int64 6 scatter_ratio 845 non-null float64 7 elongatedness 845 non-null float64 8 pr.axis_rectangularity 843 non-null float64 9 max.length_rectangularity 846 non-null int64 10 scaled_variance 843 non-null float64 11 scaled_variance.1 844 non-null float64 12 scaled_radius_of_gyration 844 non-null float64 13 scaled_radius_of_gyration.1 842 non-null float64 14 skewness_about 840 non-null float64 15 skewness_about.1 845 non-null float64 16 skewness_about.2 845 non-null float64 17 hollows_ratio 846 non-null int64 18 class 846 non-null object dtypes: float64(14), int64(4), object(1) memory usage: 125.7+ KB
print('Duplicated rows:',vehicle[vehicle.duplicated()].shape[0])
Duplicated rows: 0
print('Null values:\n',vehicle.isnull().sum())
Null values: compactness 0 circularity 5 distance_circularity 4 radius_ratio 6 pr.axis_aspect_ratio 2 max.length_aspect_ratio 0 scatter_ratio 1 elongatedness 1 pr.axis_rectangularity 3 max.length_rectangularity 0 scaled_variance 3 scaled_variance.1 2 scaled_radius_of_gyration 2 scaled_radius_of_gyration.1 4 skewness_about 6 skewness_about.1 1 skewness_about.2 1 hollows_ratio 0 class 0 dtype: int64
There appears few missing values
#replacing the missing values with median values
for cols in vehicle.columns:
if(cols != 'class'):
vehicle[cols] = vehicle[cols].fillna(vehicle[cols].median())
vehicle.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| compactness | 846.0 | 93.678487 | 8.234474 | 73.0 | 87.00 | 93.0 | 100.00 | 119.0 |
| circularity | 846.0 | 44.823877 | 6.134272 | 33.0 | 40.00 | 44.0 | 49.00 | 59.0 |
| distance_circularity | 846.0 | 82.100473 | 15.741569 | 40.0 | 70.00 | 80.0 | 98.00 | 112.0 |
| radius_ratio | 846.0 | 168.874704 | 33.401356 | 104.0 | 141.00 | 167.0 | 195.00 | 333.0 |
| pr.axis_aspect_ratio | 846.0 | 61.677305 | 7.882188 | 47.0 | 57.00 | 61.0 | 65.00 | 138.0 |
| max.length_aspect_ratio | 846.0 | 8.567376 | 4.601217 | 2.0 | 7.00 | 8.0 | 10.00 | 55.0 |
| scatter_ratio | 846.0 | 168.887707 | 33.197710 | 112.0 | 147.00 | 157.0 | 198.00 | 265.0 |
| elongatedness | 846.0 | 40.936170 | 7.811882 | 26.0 | 33.00 | 43.0 | 46.00 | 61.0 |
| pr.axis_rectangularity | 846.0 | 20.580378 | 2.588558 | 17.0 | 19.00 | 20.0 | 23.00 | 29.0 |
| max.length_rectangularity | 846.0 | 147.998818 | 14.515652 | 118.0 | 137.00 | 146.0 | 159.00 | 188.0 |
| scaled_variance | 846.0 | 188.596927 | 31.360427 | 130.0 | 167.00 | 179.0 | 217.00 | 320.0 |
| scaled_variance.1 | 846.0 | 439.314421 | 176.496341 | 184.0 | 318.25 | 363.5 | 586.75 | 1018.0 |
| scaled_radius_of_gyration | 846.0 | 174.706856 | 32.546277 | 109.0 | 149.00 | 173.5 | 198.00 | 268.0 |
| scaled_radius_of_gyration.1 | 846.0 | 72.443262 | 7.468734 | 59.0 | 67.00 | 71.5 | 75.00 | 135.0 |
| skewness_about | 846.0 | 6.361702 | 4.903244 | 0.0 | 2.00 | 6.0 | 9.00 | 22.0 |
| skewness_about.1 | 846.0 | 12.600473 | 8.930962 | 0.0 | 5.00 | 11.0 | 19.00 | 41.0 |
| skewness_about.2 | 846.0 | 188.918440 | 6.152247 | 176.0 | 184.00 | 188.0 | 193.00 | 206.0 |
| hollows_ratio | 846.0 | 195.632388 | 7.438797 | 181.0 | 190.25 | 197.0 | 201.00 | 211.0 |
#plotting the distribution of the numerical variables
vehicle.hist(bins = 20, figsize = (20, 18), color = 'red')
plt.show()
#Density plots
plt.figure(figsize=(20, 18))
col = 1
for i in vehicle.drop(columns='class').columns:
plt.subplot(4, 5, col)
sns.distplot(vehicle[i], color = 'green')
col += 1
#checking the distribution of the class variable
print(vehicle['class'].value_counts())
plt.title('Count of Vehicle Class column')
sns.countplot(x = 'class', data = vehicle);
car 429 bus 218 van 199 Name: class, dtype: int64
#boxplots for outliers
plt.figure(figsize=(25,23))
col = 1
for i in vehicle.drop(columns='class').columns:
plt.subplot(5, 4, col)
sns.boxplot(vehicle[i],color='blue')
col += 1
#Boxplots against the class with other numeric variables
fig,axs = plt.subplots(4,5,figsize=(20,18))
sns.boxplot(x="class", y="compactness",data=vehicle,ax=axs[0,0]);
sns.boxplot(x="class", y="circularity",data=vehicle,ax=axs[0,1]);
sns.boxplot(x="class", y="distance_circularity",data=vehicle,ax=axs[0,2]);
sns.boxplot(x="class", y="radius_ratio",data=vehicle,ax=axs[0,3]);
sns.boxplot(x="class", y="pr.axis_aspect_ratio",data=vehicle,ax=axs[0,4]);
sns.boxplot(x="class", y="max.length_aspect_ratio",data=vehicle,ax=axs[1,0]);
sns.boxplot(x="class", y="scatter_ratio",data=vehicle,ax=axs[1,1]);
sns.boxplot(x="class", y="elongatedness",data=vehicle,ax=axs[1,2]);
sns.boxplot(x="class", y="pr.axis_rectangularity",data=vehicle,ax=axs[1,3]);
sns.boxplot(x="class", y="max.length_rectangularity",data=vehicle,ax=axs[1,4]);
sns.boxplot(x="class", y="scaled_variance",data=vehicle,ax=axs[2,0]);
sns.boxplot(x="class", y="scaled_variance.1",data=vehicle,ax=axs[2,1]);
sns.boxplot(x="class", y="scaled_radius_of_gyration",data=vehicle,ax=axs[2,2]);
sns.boxplot(x="class", y="scaled_radius_of_gyration.1",data=vehicle,ax=axs[2,3]);
sns.boxplot(x="class", y="skewness_about",data=vehicle,ax=axs[2,4]);
sns.boxplot(x="class", y="skewness_about.1",data=vehicle,ax=axs[3,0]);
sns.boxplot(x="class", y="skewness_about.2",data=vehicle,ax=axs[3,1]);
sns.boxplot(x="class", y="hollows_ratio",data=vehicle,ax=axs[3,2]);
fig.tight_layout()
#find the outliers and replace them by median
for col_name in vehicle.drop(columns = 'class').columns:
q1 = vehicle[col_name].quantile(0.25)
q3 = vehicle[col_name].quantile(0.75)
iqr = q3 - q1
low = q1 - 1.5 * iqr
high = q3 + 1.5 * iqr
vehicle.loc[(vehicle[col_name] < low) | (vehicle[col_name] > high), col_name] = vehicle[col_name].median()
#boxplot after outlier treatment
plt.figure(figsize=(25,23))
col = 1
for i in vehicle.drop(columns='class').columns:
plt.subplot(5, 4, col)
sns.boxplot(vehicle[i],color='red')
col += 1
#checking for correlation
plt.figure(figsize=(20,18))
corr=vehicle.corr()
sns.heatmap(corr,annot=True);
#spliiting the data to check for correlation with the class variable
X = vehicle.loc[:, vehicle.columns != 'class']
y = vehicle['class'].astype('category').cat.codes
#plotting the correlation with target variable
plt.figure(figsize = (15, 8))
ax=sns.barplot(x=X.columns, y = X.corrwith(y))
sns.barplot(x = X.columns, y = X.corrwith(y))
plt.title('Correlation with Class column', fontsize = 40)
x=plt.setp(ax.get_xticklabels(), rotation=90)
#scaling the muerica variables
XScaled=X.apply(zscore)
XScaled.head()
| compactness | circularity | distance_circularity | radius_ratio | pr.axis_aspect_ratio | max.length_aspect_ratio | scatter_ratio | elongatedness | pr.axis_rectangularity | max.length_rectangularity | scaled_variance | scaled_variance.1 | scaled_radius_of_gyration | scaled_radius_of_gyration.1 | skewness_about | skewness_about.1 | skewness_about.2 | hollows_ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.160580 | 0.518073 | 0.057177 | 0.300945 | 1.933135 | 0.912212 | -0.207598 | 0.136262 | -0.224342 | 0.758332 | -0.400771 | -0.337407 | 0.285705 | -0.315806 | -0.032330 | 0.387162 | -0.312012 | 0.183957 |
| 1 | -0.325470 | -0.623732 | 0.120741 | -0.850666 | -0.740596 | 0.427456 | -0.599423 | 0.520519 | -0.610886 | -0.344578 | -0.594220 | -0.618623 | -0.513630 | 0.009122 | 0.624090 | 0.161740 | 0.013265 | 0.452977 |
| 2 | 1.254193 | 0.844303 | 1.519141 | 1.265808 | 0.863642 | 0.912212 | 1.148719 | -1.144597 | 0.935290 | 0.689401 | 1.114582 | 1.131806 | 1.392477 | 0.171586 | 1.718123 | -0.401818 | -0.149374 | 0.049447 |
| 3 | -0.082445 | -0.623732 | -0.006386 | -0.290423 | 0.328896 | 0.427456 | -0.750125 | 0.648605 | -0.610886 | -0.344578 | -0.916635 | -0.739145 | -1.466683 | -1.453054 | -0.032330 | -0.289106 | 1.639649 | 1.529056 |
| 4 | -1.054545 | -0.134387 | -0.769150 | 1.141310 | -0.027601 | -0.057300 | -0.599423 | 0.520519 | -0.610886 | -0.275646 | 1.694930 | -0.647319 | 0.408680 | -0.072110 | 0.624090 | -0.176395 | -1.450481 | -1.699181 |
#plotting the cummulative variance explained by the principal componets
pca = PCA()
X_pca_ = pca.fit_transform(XScaled)
plt.figure(figsize = (12, 8))
plt.plot((np.cumsum(pca.explained_variance_ratio_) * 100), marker = 'X')
plt.xlim(0, 18)
plt.xlabel('Number of Components')
plt.ylabel('Percentage of Cumulative Explained Variance');
print(pca.explained_variance_)
[9.74940269e+00 3.35071912e+00 1.19238155e+00 1.13381916e+00 8.83997312e-01 6.66265745e-01 3.18150910e-01 2.28179142e-01 1.31018595e-01 7.98619108e-02 7.33979478e-02 6.46162669e-02 4.01448646e-02 3.22758478e-02 2.93936408e-02 2.27005257e-02 1.98136761e-02 5.16287320e-03]
#plotting the
plt.figure(figsize = (12, 8))
plt.step(list(range(18)), (np.cumsum(pca.explained_variance_ratio_) * 100), where = 'mid')
plt.xlim(0, 18)
plt.xlabel('Number of Components')
plt.ylabel('Percentage of Cumulative Explained Variance')
plt.title('Vehicle Dataset Explained Variance');
#Using 6 components and printing the eigen vectors
pca3 = PCA(n_components=6)
pca3.fit(XScaled)
print(pca3.components_)
print(pca3.explained_variance_ratio_)
Xpca3 = pca3.transform(XScaled)
[[ 0.27250289 0.28725469 0.30242111 0.26971354 0.09786073 0.19520014 0.31052393 -0.3090069 0.307287 0.27815416 0.29976509 0.30553237 0.26323762 -0.04193594 0.03608321 0.05872048 0.03801314 0.08474 ] [-0.08704358 0.13162176 -0.04614301 -0.19793126 -0.25783995 -0.10804563 0.07528535 -0.01322994 0.0875602 0.12215424 0.07726575 0.07150302 0.21058205 0.50362158 -0.01576632 -0.09274624 -0.50162122 -0.50761211] [-0.03818521 -0.20114691 0.06346211 0.05628517 -0.06199275 -0.14895782 0.10904283 -0.09085269 0.1060705 -0.21368469 0.1445998 0.11034374 -0.20287019 0.07386402 -0.55917399 0.6706805 -0.06224071 -0.04170535] [ 0.13867501 -0.03805548 0.10895429 -0.25435509 -0.61276572 0.27867816 0.00539295 0.06521486 0.03089915 0.04146747 -0.06400509 -0.00219687 -0.08553965 -0.11539962 0.47370331 0.42842603 -0.0274096 0.09603749] [ 0.13710146 -0.13899555 -0.08001743 0.13374437 0.12360146 -0.63489335 0.08555746 -0.07907344 0.08164639 -0.25111294 0.14747122 0.11010098 -0.00521211 0.13806861 0.56655224 0.13086982 0.1805193 -0.11078807] [ 0.26361139 -0.07134743 -0.01690062 -0.13818366 -0.57782861 -0.289097 0.0977471 -0.0757283 0.10540321 -0.07819621 0.13291241 0.11539823 -0.0670574 -0.13151308 -0.31917609 -0.46840497 0.28013643 0.05944441]] [0.54099325 0.18593103 0.06616512 0.0629155 0.04905291 0.03697101]
#printing the original features and the reduced features
pca_6 = PCA(n_components = 6)
X_pca = pca_6.fit_transform(XScaled)
print('Original number of features:', X.shape[1])
print('Reduced number of features:', X_pca.shape[1])
Original number of features: 18 Reduced number of features: 6
#viewing the first 5 observations of the pca components
pca_df = pd.DataFrame(data = X_pca)
pca_df.head()
| 0 | 1 | 2 | 3 | 4 | 5 | |
|---|---|---|---|---|---|---|
| 0 | 0.584228 | -0.675673 | -0.453334 | -0.750656 | -0.777515 | -1.848809 |
| 1 | -1.512180 | -0.348934 | -0.333436 | 1.268953 | -0.324929 | -0.118317 |
| 2 | 3.913448 | 0.234507 | -1.265094 | 0.137224 | 0.915751 | -0.685594 |
| 3 | -1.535193 | -3.044413 | -0.469623 | 0.324317 | -0.611590 | 0.367777 |
| 4 | -0.642062 | 1.488882 | -0.246288 | -0.550939 | 0.471655 | -1.012697 |
sns.pairplot(pca_df, diag_kind = 'kde');
#splitting the original data into train and test 70:30
X_train, X_test, y_train, y_test = train_test_split(XScaled, y, test_size = 0.3, random_state = 10)
rtr, ctr = X_train.shape
print('The training set comprises of', rtr, 'rows and', ctr, 'columns.')
The training set comprises of 592 rows and 18 columns.
#splitting the pca data into train and test 70:30
X_tr, X_te, y_tr, y_te = train_test_split(X_pca, y, test_size = 0.3, random_state = 10)
rtr_pca, ctr_pca = X_tr.shape
print('The PCA training set comprises of', rtr_pca, 'rows and', ctr_pca, 'columns.')
The PCA training set comprises of 592 rows and 6 columns.
rt_pca, ct_pca = X_te.shape
print('The PCA test set comprises of', rt_pca, 'rows and', ct_pca, 'columns.')
The PCA test set comprises of 254 rows and 6 columns.
# Building a Support Vector Machine on train data
svc_model = SVC(C= 4, kernel='rbf', gamma='scale')
svc_model.fit(X_train, y_train)
SVC(C=4)
#predicting on train data
sv_train_predict = svc_model .predict(X_train)
print("Model Accuracy on train: {0:.4f}".format(metrics.accuracy_score(y_train, sv_train_predict)))
print()
Model Accuracy on train: 0.9899
#predicting on test data
sv_test_predict = svc_model .predict(X_test)
print("Model Accuracy on test: {0:.4f}".format(metrics.accuracy_score(y_test, sv_test_predict)))
print()
Model Accuracy on test: 0.9685
#visualization of confusion matrix in the form of a heatmap
cm= confusion_matrix(y_test, sv_test_predict)
plt.figure(figsize = (12, 8))
sns.heatmap(cm, annot = True, cmap = 'RdYlGn', fmt = 'd')
plt.xlabel('Actual Classes', fontsize = 15)
plt.ylabel('Predicted Classes', fontsize = 15)
plt.title('Confusion Matrix for SVM', fontsize = 15);
#printing classification report
print("Classification Report")
print(metrics.classification_report(y_test, sv_test_predict, labels=[0,1,2]))
Classification Report
precision recall f1-score support
0 1.00 0.99 0.99 71
1 0.98 0.96 0.97 125
2 0.90 0.97 0.93 58
accuracy 0.97 254
macro avg 0.96 0.97 0.97 254
weighted avg 0.97 0.97 0.97 254
precision_SV, recall_SV, f1_score_SV, support = precision_recall_fscore_support(y_test, sv_test_predict,average='macro')
print('Precision Score :', '%0.2f' % precision_SV)
print('Recall Score :', '%0.2f' % recall_SV)
print('F1-Score:', '%0.2f' % f1_score_SV)
SV_Acc= accuracy_score(y_test, sv_test_predict)
print('Accuracy Score :','%0.2f' % SV_Acc)
Precision Score : 0.96 Recall Score : 0.97 F1-Score: 0.97 Accuracy Score : 0.97
#SVM on the pca data
svc_model_pca = SVC(C= 4, kernel='rbf', gamma='scale')
svc_model_pca.fit(X_tr, y_tr)
SVC(C=4)
#predicting on train data
sv_tr_predict = svc_model_pca .predict(X_tr)
print("Model Accuracy on train: {0:.4f}".format(metrics.accuracy_score(y_tr, sv_tr_predict)))
print()
Model Accuracy on train: 0.9476
1) THE model svm train test accuracy 98% 2) The model pca train test accuracy 94% both the 2 model given 90 accuracy on the test data both the 2 model prefect fit the data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from scipy.stats import zscore
from sklearn.metrics import silhouette_samples, silhouette_score
import seaborn as sns
%matplotlib inline
sns.set(color_codes=True)
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
cricket=pd.read_csv('https://lms-uploads.s3.ap-southeast-1.amazonaws.com/account_1/attachments/91254/batting_bowling_ipl_bat.csv?response-content-disposition=attachment%3B%20filename%3D%22Part4%20-%20batting_bowling_ipl_bat.csv%22%3B%20filename%2A%3DUTF-8%27%27Part4%2520%252D%2520batting%255Fbowling%255Fipl%255Fbat.csv&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIA434J2A76JGUD2PBI%2F20210731%2Fap-southeast-1%2Fs3%2Faws4_request&X-Amz-Date=20210731T125534Z&X-Amz-Expires=86400&X-Amz-SignedHeaders=host&X-Amz-Signature=3be5f8ec58edec4a78b6690ef2c8abbaa1511a55aa5d28185e1aca10784f2b7a')
cricket=data.dropna(axis=0)
cricket.head()
| Name | Runs | Ave | SR | Fours | Sixes | HF | |
|---|---|---|---|---|---|---|---|
| 1 | CH Gayle | 733.0 | 61.08 | 160.74 | 46.0 | 59.0 | 9.0 |
| 3 | G Gambhir | 590.0 | 36.87 | 143.55 | 64.0 | 17.0 | 6.0 |
| 5 | V Sehwag | 495.0 | 33.00 | 161.23 | 57.0 | 19.0 | 5.0 |
| 7 | CL White | 479.0 | 43.54 | 149.68 | 41.0 | 20.0 | 5.0 |
| 9 | S Dhawan | 569.0 | 40.64 | 129.61 | 58.0 | 18.0 | 5.0 |
rows, column = cricket.shape
print('The dataset contains', rows, 'rows and', column, 'columns.')
The dataset contains 90 rows and 7 columns.
print('The data type of each attribute: \n')
cricket.info()
The data type of each attribute: <class 'pandas.core.frame.DataFrame'> Int64Index: 90 entries, 1 to 179 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Name 90 non-null object 1 Runs 90 non-null float64 2 Ave 90 non-null float64 3 SR 90 non-null float64 4 Fours 90 non-null float64 5 Sixes 90 non-null float64 6 HF 90 non-null float64 dtypes: float64(6), object(1) memory usage: 5.6+ KB
cricket.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Runs | 90.0 | 219.933333 | 156.253669 | 2.00 | 98.000 | 196.500 | 330.7500 | 733.00 |
| Ave | 90.0 | 24.729889 | 13.619215 | 0.50 | 14.665 | 24.440 | 32.1950 | 81.33 |
| SR | 90.0 | 119.164111 | 23.656547 | 18.18 | 108.745 | 120.135 | 131.9975 | 164.10 |
| Fours | 90.0 | 19.788889 | 16.399845 | 0.00 | 6.250 | 16.000 | 28.0000 | 73.00 |
| Sixes | 90.0 | 7.577778 | 8.001373 | 0.00 | 3.000 | 6.000 | 10.0000 | 59.00 |
| HF | 90.0 | 1.188889 | 1.688656 | 0.00 | 0.000 | 0.500 | 2.0000 | 9.00 |
print('Duplicated rows: ', data[data.duplicated()].shape[0])
Duplicated rows: 0
plt.figure(figsize=(15,8))
b1=cricket.groupby('Name')['Runs'].sum().sort_values(ascending = False ).head(10)
b1= b1.reset_index()
b1.columns = ['Name' ,'Runs']
sns.barplot(data= b1 , x= 'Name' , y ='Runs');
plt.title("Top 10 Players by Runs");
plt.figure(figsize=(15,8))
b1=cricket.groupby('Name')['Ave'].sum().sort_values(ascending = False ).head(10)
b1= b1.reset_index()
b1.columns = ['Name' ,'Ave']
sns.barplot(data= b1 , x= 'Name' , y ='Ave')
plt.title("Top 10 players by Average");
plt.figure(figsize=(15,8))
b1=cricket.groupby('Name')['SR'].sum().sort_values(ascending = False ).head(10)
b1= b1.reset_index()
b1.columns = ['Name' ,'SR']
sns.barplot(data= b1 , x= 'Name' , y ='SR')
plt.title("Top 10 players by Strike Rate");
plt.figure(figsize=(15,8))
b1=cricket.groupby('Name')['SR'].sum().sort_values(ascending = False ).head(10)
b1= b1.reset_index()
b1.columns = ['Name' ,'SR']
sns.barplot(data= b1 , x= 'Name' , y ='SR')
plt.title("Top 10 players by Strike Rate");
plt.figure(figsize=(15,8))
b1=cricket.groupby('Name')['Sixes'].sum().sort_values(ascending = False ).head(10)
b1= b1.reset_index()
b1.columns = ['Name' ,'Sixes']
sns.barplot(data= b1 , x= 'Name' , y ='Sixes')
plt.title("Top 10 players by Sixes");
plt.figure(figsize=(15,8))
b1=cricket.groupby('Name')['HF'].sum().sort_values(ascending = False ).head(10)
b1= b1.reset_index()
b1.columns = ['Name' ,'HF']
sns.barplot(data= b1 , x= 'Name' , y ='HF')
plt.title("Top 10 players by Half Centuries");
cricket.hist(bins = 20, figsize = (15, 10), color = 'blue')
plt.show()
plt.figure(figsize=(15, 12))
col = 1
for i in cricket.drop(columns='Name').columns:
plt.subplot(2, 3, col)
sns.distplot(cricket[i], color = 'green')
col += 1
plt.figure(figsize=(15, 10))
col = 1
for i in cricket.drop(columns='Name').columns:
plt.subplot(2, 3, col)
sns.boxplot(cricket[i],color='red')
col += 1
There appears to be outliers, will not be treating them as its highly likely that these are genuine observation
import seaborn as sns
sns.pairplot(cricket, kind="reg") # plots scatter plots for every pair of attributes and histograms along the diagonal
plt.show()
#checking for correlation
plt.figure(figsize=(10,8))
corr=cricket.drop(columns='Name').corr()
sns.heatmap(corr,annot=True);
#scaling the numeric variables
cc = cricket.iloc[:,1:7]
cc1 = cc.apply(zscore)
cc1.head()
| Runs | Ave | SR | Fours | Sixes | HF | |
|---|---|---|---|---|---|---|
| 1 | 3.301945 | 2.683984 | 1.767325 | 1.607207 | 6.462679 | 4.651551 |
| 3 | 2.381639 | 0.896390 | 1.036605 | 2.710928 | 1.184173 | 2.865038 |
| 5 | 1.770248 | 0.610640 | 1.788154 | 2.281703 | 1.435530 | 2.269533 |
| 7 | 1.667276 | 1.388883 | 1.297182 | 1.300618 | 1.561209 | 2.269533 |
| 9 | 2.246490 | 1.174755 | 0.444038 | 2.343021 | 1.309851 | 2.269533 |
#checking for the within sum of squares
wss =[]
for i in range(1,6):
KM = KMeans(n_clusters=i)
KM.fit(cc1)
wss.append(KM.inertia_)
wss
[540.0000000000001, 296.26135354732025, 223.36518460916952, 178.1750796759041, 144.68430051153075]
#plotting to check for optimal clustres
plt.plot(range(1,6), wss);
plt.title('Elbow Method');
plt.xlabel("Number of Clusters")
plt.ylabel("WSS");
#using 2 centroids
k_means = KMeans(n_clusters = 2)
k_means.fit(cc1)
labels = k_means.labels_
# Calculating silhouette_score
silhouette_score(cc1,labels)
0.41111085574076756
#plotting silhouette score for different centroids
kmeans_kwargs = {
"init": "random",
"n_init": 10,
"max_iter": 300,
"random_state": 42,
}
silhouette_coefficients = []
# Notice you start at 2 clusters for silhouette coefficient
for k in range(2, 6):
kmeans = KMeans(n_clusters=k, **kmeans_kwargs)
kmeans.fit(cc1)
score = silhouette_score(cc1,kmeans.labels_)
silhouette_coefficients.append(score)
plt.plot(range(2,6), silhouette_coefficients)
plt.xticks(range(2, 6))
plt.xlabel("Number of Clusters")
plt.ylabel("Silhouette Coefficient")
plt.show()
#attaching the labels to the original data
cricket['cluster']=labels
cricket.head()
| Name | Runs | Ave | SR | Fours | Sixes | HF | cluster | |
|---|---|---|---|---|---|---|---|---|
| 1 | CH Gayle | 733.0 | 61.08 | 160.74 | 46.0 | 59.0 | 9.0 | 0 |
| 3 | G Gambhir | 590.0 | 36.87 | 143.55 | 64.0 | 17.0 | 6.0 | 0 |
| 5 | V Sehwag | 495.0 | 33.00 | 161.23 | 57.0 | 19.0 | 5.0 | 0 |
| 7 | CL White | 479.0 | 43.54 | 149.68 | 41.0 | 20.0 | 5.0 | 0 |
| 9 | S Dhawan | 569.0 | 40.64 | 129.61 | 58.0 | 18.0 | 5.0 | 0 |
#veiwing the distribution of the clusters
cricket.cluster.value_counts().sort_index()
0 35 1 55 Name: cluster, dtype: int64
#aggregating the clusters with the numeric variables with their mean
aggdata=cricket.iloc[:,0:9].groupby('cluster').mean()
aggdata['Freq']=cricket.cluster.value_counts().sort_index()
aggdata
| Runs | Ave | SR | Fours | Sixes | HF | Freq | |
|---|---|---|---|---|---|---|---|
| cluster | |||||||
| 0 | 372.657143 | 35.962286 | 132.808286 | 34.685714 | 13.771429 | 2.714286 | 35 |
| 1 | 122.745455 | 17.582000 | 110.481455 | 10.309091 | 3.636364 | 0.218182 | 55 |
#based on the above table renaming/ranking the playes in Grade A and Grade B
cricket['cluster'] = cricket['cluster'].replace({1: 'Grade A', 0: 'Grade B', })
#list of Grade A players
Grade_A = cricket[cricket['cluster'] == 'Grade A']
Grade_A.head(10)
| Name | Runs | Ave | SR | Fours | Sixes | HF | cluster | |
|---|---|---|---|---|---|---|---|---|
| 69 | MS Bisla | 213.0 | 30.42 | 133.12 | 16.0 | 10.0 | 1.0 | Grade A |
| 73 | BJ Hodge | 245.0 | 30.62 | 140.00 | 18.0 | 9.0 | 0.0 | Grade A |
| 75 | NV Ojha | 255.0 | 23.18 | 113.83 | 21.0 | 13.0 | 1.0 | Grade A |
| 77 | DB Das | 126.0 | 42.00 | 135.48 | 9.0 | 6.0 | 0.0 | Grade A |
| 79 | AC Gilchrist | 172.0 | 34.40 | 120.27 | 21.0 | 4.0 | 1.0 | Grade A |
| 81 | BB McCullum | 289.0 | 24.08 | 102.12 | 37.0 | 3.0 | 1.0 | Grade A |
| 83 | IK Pathan | 176.0 | 25.14 | 139.68 | 14.0 | 6.0 | 0.0 | Grade A |
| 85 | Azhar Mahmood | 186.0 | 23.25 | 130.98 | 16.0 | 8.0 | 0.0 | Grade A |
| 87 | MK Pandey | 143.0 | 20.42 | 127.67 | 12.0 | 6.0 | 1.0 | Grade A |
| 89 | S Badrinath | 196.0 | 28.00 | 108.28 | 23.0 | 2.0 | 1.0 | Grade A |
#list of Grade B players
Grade_B = cricket[cricket['cluster'] == 'Grade B']
Grade_B.head()
| Name | Runs | Ave | SR | Fours | Sixes | HF | cluster | |
|---|---|---|---|---|---|---|---|---|
| 1 | CH Gayle | 733.0 | 61.08 | 160.74 | 46.0 | 59.0 | 9.0 | Grade B |
| 3 | G Gambhir | 590.0 | 36.87 | 143.55 | 64.0 | 17.0 | 6.0 | Grade B |
| 5 | V Sehwag | 495.0 | 33.00 | 161.23 | 57.0 | 19.0 | 5.0 | Grade B |
| 7 | CL White | 479.0 | 43.54 | 149.68 | 41.0 | 20.0 | 5.0 | Grade B |
| 9 | S Dhawan | 569.0 | 40.64 | 129.61 | 58.0 | 18.0 | 5.0 | Grade B |
In IPL series finding the Grade A players & Grade B players batsman & finding the runs and strike rate fours & sixes & half centuries result
List of all Possible Dimensionality Techniques and when to use them
Dimensionality reduction techniques can be classifed into 3 types..!
A)missing valve ratio: if the dataset has too many missing values, we use this approach to reduce the number of varibles.We can drop the variables having a large number of missing values in them.
B)low variance filter: we apply this approach to identify and drop constant variables from the dataset.The target variable is not unduly affected by variables with low variance, and hence these variables can be safely dropped.
C)High Correlation filter: A pair of variables having high correlation increases multicollinearity in the dataset. So, we can use this technique to find highly correlated features and drop them accordingly.
D)Random Forest: This is one of the most commonly used techniques which tells us the importance of each feature present in the dataset. We can find the importance of each feature and keep the top most features, resulting in dimensionality reduction.
E)Both Backward Feature Elimination and Forward Feature Selection techniques take a lot of computational time and are thus generally used on smaller datasets.
A) Factor Analysis: This technique is best suited for situations where we have highly correlated set of variables. It divides the variables based on their correlation into different groups, and represents each group with a factor.
B) Principal Component Analysis: This is one of the most widely used techniques for dealing with linear data. It divides the data into a set of components which try to explain as much variance as possible.
C) Independent Component Analysis: We can use ICA to transform the data into independent components which describe the data using less number of components.
A) ISOMAP: We use this technique when the data is strongly non-linear.
B) t-SNE: This technique also works well when the data is strongly non-linear. It works extremely well for visualizations as well.
C) UMAP: This technique works well for high dimensional data. Its run-time is shorter as compared to t-SNE.
from sklearn.datasets import load_digits digits = load_digits() digits.images.shape
from sklearn.datasets import load_digits
digits = load_digits()
digits.images.shape
(1797, 8, 8)
# visualize the first hundred of these
import matplotlib.pyplot as plt
fig, axes = plt.subplots(10, 10, figsize=(8, 8),
subplot_kw={'xticks':[], 'yticks':[]},
gridspec_kw=dict(hspace=0.1, wspace=0.1))
for i,ax in enumerate(axes.flat):
ax.imshow(digits.images[i], cmap='binary', interpolation='nearest')
ax.text(0.05, 0.05, str(digits.target[i]),
transform=ax.transAxes, color='red')
#two-dimensional,[n_samples, n_features] representation
X = digits.data
X.shape
(1797, 64)
y = digits.target
y.shape
(1797,)
#transform the data to two dimensions using manifold learning algorithm called Isomap
from sklearn.manifold import Isomap
iso = Isomap(n_components=2)
iso.fit(digits.data)
data_projected = iso.transform(digits.data)
data_projected.shape
(1797, 2)
# plot this data to see if we can learn anything from its structure
plt.figure(figsize=(10, 8))
plt.scatter(data_projected[:, 0], data_projected[:, 1], c=digits.target,edgecolor='none', alpha=0.5,cmap=plt.cm.get_cmap('magma_r', 10))
plt.colorbar(label='digit label', ticks=range(10))
plt.clim(-0.5, 9.5);
from sklearn.model_selection import train_test_split
# Split X and y into training and test set in 70:30 ratio
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=15)
#fit a Gaussian naive Bayes model
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X_train, y_train)
pred = model.predict(X_test)
# gauge its accuracy by comparing the true values of the test set to the predictions
from sklearn.metrics import accuracy_score
accuracy_score(y_test, pred)
0.8407407407407408
from sklearn.metrics import confusion_matrix
import seaborn as sns
%matplotlib inline
sns.set(color_codes=True)
#visualization of confusion matrix in the form of a heatmap
plt.figure(figsize = (12, 8))
mat = confusion_matrix(y_test, pred)
sns.heatmap(mat, square=True, annot=True, cbar=False)
plt.xlabel('predicted value')
plt.ylabel('Actual value');
plt.title('Confusion Matrix HeatMap', fontsize = 15);
fig, axes = plt.subplots(10, 10, figsize=(8, 8),
subplot_kw={'xticks':[], 'yticks':[]},
gridspec_kw=dict(hspace=0.1, wspace=0.1))
for i, ax in enumerate(axes.flat):
ax.imshow(digits.images[i], cmap='binary', interpolation='nearest')
ax.text(0.05, 0.05, str(pred[i]),
transform=ax.transAxes,
color='green' if (y_test[i] == pred[i]) else 'red')